-
Notifications
You must be signed in to change notification settings - Fork 480
Add support for unbounded look-behind expressions #1266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add support for unbounded look-behind expressions #1266
Conversation
An optimization to the bounded-length look-behinds yields large speed-ups on benchmarks used for evaluation (up to 150 times faster): benchmark python/re rust/regex-lookbehind rust/regex-lookbehind-new
--------- --------- --------------------- -------------------------
lookbehind/snort/snort-0 2.2 GB/s (1.00x) 45.0 MB/s (50.40x) 1034.7 MB/s (2.19x)
lookbehind/snort/snort-1 204.0 MB/s (1.00x) 34.3 MB/s (5.94x) 34.1 MB/s (5.99x)
lookbehind/snort/snort-2 107.1 MB/s (71.24x) 53.0 MB/s (143.94x) 7.5 GB/s (1.00x)
lookbehind/snort/snort-3 100.7 MB/s (80.25x) 102.2 MB/s (79.08x) 7.9 GB/s (1.00x)
lookbehind/snort/snort-4 2041.9 MB/s (1.00x) 45.9 MB/s (44.52x) 967.3 MB/s (2.11x)
|
We have now published a write-up of our development process here. This might help in understanding some of our design choices better. Furthermore, it also describes some additional work on the bounded backtracker to support the same features there and the challenges that go along with it. These changes are available on the backtracking branch of our fork. |
Sorry maybe I'm being naive... But this PR sounds like a very useful addition, is there anything that could be done to encourage this to be accepted ? |
|
This comment was marked as duplicate.
This comment was marked as duplicate.
This is the first step to supporting captureless lookbehind assertions
The lack of recursing into the inner expression of a lookaround is correct under the current assumption that lookarounds cannot have capture groups. But once the restriction is lifted, this wrong implementation can be very subtle to find. Instead, we can already do the filtering and accept it being a no-op for now.
This makes it consistent with parser's ErrorKind::UnsupportedLookAround.
b97fb5a
to
3d13971
Compare
Thanks so much for submitting this! I think it would be a good idea to experiment with this on Instead, I think a more plausible route here is the following:
Ideally it would be broken up into more PRs than the above, but I don't know where the best logical breakpoints are besides what I listed above. And ideally, the commits wouldn't be "here is my development history," but rather, "here is a logical series of patches that can each be reviewed in relative isolation." I also want to be clear that I have not yet decided on whether this should be available in |
Thank you for your feedback and for being open to experiment before commitment. We agree such a large PR is hard to review and reason about. We will come up with a more detailed plan for smaller, more focused PRs. Once the plan is done, we will be happy to receive feedback on it before we get to work. |
As an example consider the regex
(?<=Title:\s+)\w+
which would match thefollowing strings (matches underlined with
~
):But does not match:
No heading
title: bad case
Title:nospace
What
This PR implements the streaming algorithm from
Linear Matching of JavaScript Regular Expressions (Section 4.4)
for unbounded look-behinds. The same algorithm has been
implemented and merged into V8.
The addition of look-around expressions to this crate was mentioned previously
in #1153.
This PR adds support for positive and negative look-behinds with arbitrary
nesting. With the following limitations
Limitations
Capture groups outside of look-arounds are supported. With the current capture
group semantics, no linear time algorithm which would allow for capture groups
inside of look-arounds is known. However, look-behinds could be implemented in
other engines and with prefilters on. Look-aheads could also be implemented with
additional memory.
How
We implemented the streaming algorithm presented in Section 4.4 of the paper
mentioned above. The algorithm works by running the sub-automata for any
look-behind expressions in parallel to the main automaton. This is achieved by
compiling the look-behind expressions as usual but storing their start states
separately, not reachable from the main automaton.
Instead of a
match
state, the sub-automata for look-behinds have aWriteLookAround
state. This state causes the current position in the haystackto be recorded in a global look-around table.
The main automaton (and the sub-automata in the case of nested look-behinds) can
then read from this table by means of a
CheckLookAround
instruction andcompare the stored index with the current position in the haystack. These states
work as conditional epsilon transitions, similar to the already supported "look"
assertions (e.g.
^
,\b
,$
).PikeVM
's cache has been expanded to preserve good performance of single-matchsearches (stop the look-around threads once the main automaton finishes) and of
all-matches searches (remember the look-around states when resuming a search to
prevent having to rescan the haystack from the beginning).
Testing
We have added unit tests for the new functionality in the individual test
modules to test the new parsing, translation, and compilation features. We have
further added integration tests in the form of a new toml file. All engines
apart from the PikeVM will reject look-behind expressions. Thus tests containing
look-around expressions are filtered out for engines other than the PikeVM and
Meta engine.
Future Work
We would love to get feedback on the implementation.
The next steps are to work on the current limitations. Namely, implement support
in more engines and enable prefilters. Additionally, support for look-aheads
would be implemented if the additional memory cost is acceptable.
We are open to the discussion about any of the above.
Performance
We forked
rebar
and added a new enginedefinition (
rust/regex-lookbehind
) for our fork ofregex
. We added this newengine definition to all benchmarks where
rust/regex
was already present.Furthermore, we added some benchmark definitions to measure the performance
of the look-behind algorithm.
We ran the full suite of benchmarks twice and merged the results. They are available
in our rebar fork (
results_full_combined.csv
)Results without look-behinds
The results from all benchmarks without look-behinds show that our changes do not
introduce a significant slowdown for regexes that were already supported:
Note: We noticed a discrepancy across multiple runs of up to 1.51 when comparing the
current version of
rust/regex
:Due to this result, we conclude that, despite the highest speedup ratio being 1.57 when
comparing both engines across both runs, the results of all individual benchmarks
further strengthen the claim that our changes do not significantly impact performance.
Full benchmark comparison (without look-behinds)
Results with look-behinds
To get an estimate for performance of "real-world regexes" using look-behinds,
we extracted all regexes that contain look-behind expressions from the
snort
ruleset. We chose this as a source of regexes because it has been used as a
benchmark for look-arounds before in Efficient Matching of Regular Expressions with Lookaround Assertions.
Unfortunately, this ruleset is licensed in a way that prohibits us from
distributing it. See the reproduction section below to learn where to get the
ruleset from and how to extract the regexes.
Furthermore, we wrote a couple of very simple benchmarks to demonstrate that
our implementation respects linearity.
We chose to compare our implementation to
python/re
, as this engine is readilyavailable, hence easy to benchmark, and used ubiquitously. Note, however, that
python/re
only supports bounded length look-behinds, while our implementationsupports unbounded ones as well.
Look-behind benchmark comparison
A few things to note:
snort-0
andsnort-4
are the only ones where there is anopportunity for prefiltering based on a prefix literal, which we haven't implemented currently.
This explains the huge difference in speedup compared to all other regexes.
speedup ratio between
pyhton/re
andrust/*
is similar to the values seenhere (e.g.
imported/sherlock/everything-greedy-nl
,curated/08-words/long-russian
).We therefore conclude that the baseline performance for regexes with
look-behinds is reasonable.
linear-haystack
benchmarks shows that ouralgorithm indeed runs in linear time.
How to reproduce
Please follow these instructions to reproduce our results:
snapshot
3200
of the rules in the "Registered" column.rebar
forksnortrules-snapshot-3200
in the root of the cloned repo.benchmark_lookbehind.sh
for the prerequisites. If you areon a debian/ubuntu system, you can install them easily by running
./benchmark_lookbehind.sh --install
(requires root privileges)../benchmark_lookbehind.sh
to run the benchmark.results_full.csv
andresults_lookbehind.csv
, which are placed in the directory containing therebar fork.
Acknowledgements
This was a joint effort by @shilangyu and @Multimodcrafter, supervised by Aurèle Barrière and Clément Pit-Claudel at EPFL's SYSTEMF.